Topic Labeling of Multilingual Broadcast News in the Informedi
نویسندگان
چکیده
The Informedia Digital Video Library Project includes a multilingual component for retrieval of video documents in multiple languages and a topic-labeling component for English video documents. We now extend this capability to English topic labeling of foreign-language broadcast-news stories. News stories are coarsely machine-translated into English, then assigned to a topic category using a K-nearest-neighbor algorithm. In preliminary tests on Croatian television news, topic assignment based on the best available machine translation technology showed performance only 8% worse (on a standard F-measure of performance) than that based on manual document translation. Using a phrase-based MT module the performance degradation was 31%. 1 The Informedia Digital Video Library The Informedia Digital Library Project [1,2] allows full content indexing and retrieval of text, audio and video material, similar to what is available today for text only. To enable this access to video, speech recognition is used to provide a text transcript for the audio track, image processing determines scene boundaries, recognizes faces and allows for image similarity comparisons. Everything is indexed into a searchable digital video library [4,6], where users can submit queries and retrieve relevant news stories as results. News-on-Demand is a particular collection in the Informedia Digital Library that has served as a test-bed for automatic library creation techniques. As of July 1998, the Informedia project had about 1.3 terabytes of news video indexed and accessible online, with 1200 news broadcasts containing 24000 news stories. The Informedia digital video library system has two distinct subsystems: the Library Creation System and the Library Exploration Client. The library creation system runs every night, automatically capturing, processing and adding current news shows to the library. It is during the library creation phase, that topics for news stories are automatically assigned to incoming stories. In [17], we described and evaluated tested a topic labeling component for the English language version of the Informedia Digital Video Library. During library exploration, the user can browse or search these stories and topics using the library exploration client. At 5 topics, the KNN-based system’s recall was 0.49; and relevance was 0.48, with an F-measure at equal recall and precision of about 0.48. 2 Related Research on Topic Detection The work reported here is similar in spirit to an approach reported by Schwartz [4], who classifies news stories into a static set using a Hidden Markov Model approach and found that to be somewhat better than a naïve Bayesian approach. Yang [7] also reports on other techniques, which try to cluster news stories into stories of similar topic content. This work differs in that the topic categories here are defined a priori, and do not change over with different data sets. We felt a fixed set of categories would better reflect the user needs than a clustering approach, which could yield different clusters on different days, depending on the contents of the corpus. We are extending the topic detection work and applying it in combination with machine translation techniques. 3 Multilingual Informedia The Multilingual Informedia Project demonstrates a seamless extension of the Informedia approach to search and discovery across video documents in multiple languages. The new system performs speech recognition on foreign language news broadcasts, segments it into stories and indexes the foreign data together with English news data from English language sources. 3.1 The Components of Multi-Lingual Informedia There are three components in the Multilingual Informedia System [19] that differ significantly from the original Informedia system: The speech recognizer recognizes a foreign language, specifically Croatian [9,10,11]. This component will not be described here. In the first multi-lingual Informedia system, translingual broadcast retrieval was enabled by machine translation of an English query into the language(s) of the broadcasts, in our case Croatian. This enabled a search for equivalent words in a joint corpus of English and Croatian news broadcasts. A phrase-based translation module described in [19], provided the machine-translation capability. This translation module was also used to translate the complete broadcasts into English for some of the topic-detection experiments reported in this paper. In addition, for some of the current experiments, a version of the example-based machine translation system DIPLOMAT [18, 15] was used for “high-quality” translations of the news stories. English topic labels for the foreign language news stories allow a user to identify a relevant story in the target language. In this paper, we will mostly describe this foreign language news topic classification component in detail. 3.2 The Informedia Translation Facility The current version of the translation facility attempts to translate phrases it finds in a source-language text. The facility takes advantage of multi-word phrase entries in a machine-readable dictionary [16]. It uses a recursive procedure to search for dictionary entries corresponding to progressively smaller chunks of the input. The target-language equivalents of the chunks it finds get concatenated to form the output string. In general, this text-translation facility will work with any language pair so long as a bilingual machine-readable dictionary is available in the format the program understands. The DIPLOMAT example-based machine translation system developed here at Carnegie Mellon University was also put to use for “high-quality” story translation from Croatian into English. 4 Foreign Language Topic Detection After initial experiments with the Croatian news processed by the Multilingual Informedia system [19], it became clear that returning a foreign language result to the user was not sufficient. The users were unable to tell if a particular news clip was actually relevant to their query, or if it was returned due to poor query translation or inadequate information retrieval techniques. To allow the user at least some judgment about the returned stories, we attempted to label each Croatian news story with an English-language topic. The topic identification was done using the Informedia translation facility to translate the whole story into English words. This translation became the topic query. Separately, we had indexed about 35000 English language news stories, which had manually assigned topics assigned to them. Using the SMART information retrieval system, we now used the translated topic query to retrieve the most relevant 10 labeled English stories. Each of the topics for the labeled stories that were retrieved was weighted by its relevance to the topic query and the weights for each topic were summed. The most favored topics, above a threshold, were then used to provide a topic label for the Croatian news story. This topic label allows the user to identify the general topic area of an otherwise incomprehensible foreign language text and determine if it is relevant at least in the topic area.
منابع مشابه
Topic Labeling of Multilingual Broadcast News in the Informedia Digital Video Library
Informedia Digital Video Library Alexander G. Hauptmann, Danny Lee and Paul E. Kennedy Abstract The Informedia Digital Video Library Project includes a multilingual component for retrieval of video documents in multiple languages and a topic-labeling component for English video documents. We now extend this capability to English topic labeling of foreign-language broadcast-news stories. News st...
متن کاملLarge, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts
This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segment...
متن کاملOnline Story Segmentation of Multilingual Streaming Broadcast News
We present an online story segmentation approach for Broadcast News (BN) that is built upon and integrated into BBN COTS multilingual Broadcast Monitoring System (BMS). We take a discriminative model-based approach, using Support Vector Machines to segment BN transcriptions into thematically coherent stories within the real-time constraints defined by BMS. We extract lexical, topical and story ...
متن کاملTranscribing Multilingual Broadcast News Using Hypothesis Driven Lexical Adaptation
This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well studied languages, the problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data, and 2.) the excessive voca...
متن کاملThe need to create a media block for the convergence of overseas news networks
As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...
متن کامل